Monitoring and tracking cirme records of cities seems fairly important, it not only implicitly states how crimes are committed but also gives the authorities a better way to analyze the features of crimes across cities and enforce more security to reduce the crimes efficiently. Therefore, we are giving this turorial of how to do the analysis of a San Fransico Crime Data Set in a way of data science that we learned in class of CMSC320.
In this project, we basing on the dataset on the Kaggle provided by Roshan Sharma to give a tutorial how to do the anaylsis. Generally we are splitting this tutorial into 3 parts:
Evaluating data basing on single attributes
Attribute analysis with application of Interactive maps
Prediction and regression analysis
By doing this tutorial, we could not only teach people how to take analysis of dataset in a way of data science, but also give suggestions on how authorities should do to reduce crimes by our anaylysis on the dataset.
First, we need to load some libraries needed for our project:
library(rvest)
library(tidyverse)
library(tidyr)
library(lubridate)
library(dplyr)
library(leaflet)
library(stringi)
library(broom)
library(tree)
We get data from Roshan Sharma’s Kaggle page, then we get our dataset by read.csv
data <- read.csv("Police_Department_Incidents_-_Previous_Year__2016_.csv")
head(data,10)
## IncidntNum Category Descript
## 1 120058272 WEAPON LAWS POSS OF PROHIBITED WEAPON
## 2 120058272 WEAPON LAWS FIREARM, LOADED, IN VEHICLE, POSSESSION OR USE
## 3 141059263 WARRANTS WARRANT ARREST
## 4 160013662 NON-CRIMINAL LOST PROPERTY
## 5 160002740 NON-CRIMINAL LOST PROPERTY
## 6 160002869 ASSAULT BATTERY
## 7 160003130 OTHER OFFENSES PAROLE VIOLATION
## 8 160003259 NON-CRIMINAL FIRE REPORT
## 9 160003970 WARRANTS WARRANT ARREST
## 10 160003641 MISSING PERSON FOUND PERSON
## DayOfWeek Date Time PdDistrict Resolution
## 1 Friday 01/29/2016 12:00:00 AM 11:00 SOUTHERN ARREST, BOOKED
## 2 Friday 01/29/2016 12:00:00 AM 11:00 SOUTHERN ARREST, BOOKED
## 3 Monday 04/25/2016 12:00:00 AM 14:59 BAYVIEW ARREST, BOOKED
## 4 Tuesday 01/05/2016 12:00:00 AM 23:50 TENDERLOIN NONE
## 5 Friday 01/01/2016 12:00:00 AM 00:30 MISSION NONE
## 6 Friday 01/01/2016 12:00:00 AM 21:35 NORTHERN NONE
## 7 Saturday 01/02/2016 12:00:00 AM 00:04 SOUTHERN ARREST, BOOKED
## 8 Saturday 01/02/2016 12:00:00 AM 01:02 TENDERLOIN NONE
## 9 Saturday 01/02/2016 12:00:00 AM 12:21 SOUTHERN ARREST, BOOKED
## 10 Friday 01/01/2016 12:00:00 AM 10:06 BAYVIEW NONE
## Address X Y
## 1 800 Block of BRYANT ST -122.4034 37.77542
## 2 800 Block of BRYANT ST -122.4034 37.77542
## 3 KEITH ST / SHAFTER AV -122.3889 37.72998
## 4 JONES ST / OFARRELL ST -122.4130 37.78579
## 5 16TH ST / MISSION ST -122.4197 37.76505
## 6 1700 Block of BUSH ST -122.4261 37.78802
## 7 MARY ST / HOWARD ST -122.4057 37.78088
## 8 200 Block of EDDY ST -122.4118 37.78398
## 9 4TH ST / BERRY ST -122.3934 37.77579
## 10 100 Block of CAMERON WY -122.3872 37.72097
## Location PdId
## 1 (37.775420706711, -122.403404791479) 1.200583e+13
## 2 (37.775420706711, -122.403404791479) 1.200583e+13
## 3 (37.7299809672996, -122.388856204292) 1.410593e+13
## 4 (37.7857883766888, -122.412970537591) 1.600137e+13
## 5 (37.7650501214668, -122.419671780296) 1.600027e+13
## 6 (37.788018555829, -122.426077177375) 1.600029e+13
## 7 (37.7808789360214, -122.405721454567) 1.600031e+13
## 8 (37.7839805592634, -122.411778295992) 1.600033e+13
## 9 (37.7757876218293, -122.393357241451) 1.600040e+13
## 10 (37.7209669615499, -122.387181635995) 1.600036e+13
There are 12 attributes, their data types and attributes’ descriptions from the Kaggle websites.
| Num | Name | Type | Description |
|---|---|---|---|
| 1 | IncidntNum |
categorical | Incident Number |
| 2 | Category |
categorical unordered | Description of Crime |
| 3 | DayOfWeek |
categorical unordered | Day of Week when the crime happened |
| 4 | Date |
Datetime | Date |
| 5 | Time |
Datetime | Time |
| 6 | PdDistrict |
categorical unorded | District |
| 7 | Resolution |
categorical unorded | Kind of Punishment given to the criminal to resolve the case |
| 8 | Address |
Geolocation | Address where the crime scene happened |
| 9 | X |
Geolocation | Latitude of the crime Location |
| 10 | Y |
Geolocation | Longitude of the Crime Location |
| 11 | Location |
Geolocation | Exact Location Name |
| 12 | PdId |
other | Pd Id |
Let’s tidy the data: * First, we ignore the last attribute pd Id because it is not so useful to take analysis * Second We deal with the date here, as you could see, the time part in Date attribute is always 12:00:00 Am, so we would like to take it off make Date and Time attribute a datatype of datetime * For better comparsion, we pull the numeric value of Month and hour in Date and Time attributes
tidy <- data %>%
mutate(Time = hm(Time))%>%
mutate(hour = hour(Time)) %>%
mutate(Date = mdy_hms(Date)) %>%
mutate(Month = format(Date, "%m")) %>%
select(-PdId)
head(tidy)
## IncidntNum Category Descript
## 1 120058272 WEAPON LAWS POSS OF PROHIBITED WEAPON
## 2 120058272 WEAPON LAWS FIREARM, LOADED, IN VEHICLE, POSSESSION OR USE
## 3 141059263 WARRANTS WARRANT ARREST
## 4 160013662 NON-CRIMINAL LOST PROPERTY
## 5 160002740 NON-CRIMINAL LOST PROPERTY
## 6 160002869 ASSAULT BATTERY
## DayOfWeek Date Time PdDistrict Resolution
## 1 Friday 2016-01-29 11H 0M 0S SOUTHERN ARREST, BOOKED
## 2 Friday 2016-01-29 11H 0M 0S SOUTHERN ARREST, BOOKED
## 3 Monday 2016-04-25 14H 59M 0S BAYVIEW ARREST, BOOKED
## 4 Tuesday 2016-01-05 23H 50M 0S TENDERLOIN NONE
## 5 Friday 2016-01-01 30M 0S MISSION NONE
## 6 Friday 2016-01-01 21H 35M 0S NORTHERN NONE
## Address X Y
## 1 800 Block of BRYANT ST -122.4034 37.77542
## 2 800 Block of BRYANT ST -122.4034 37.77542
## 3 KEITH ST / SHAFTER AV -122.3889 37.72998
## 4 JONES ST / OFARRELL ST -122.4130 37.78579
## 5 16TH ST / MISSION ST -122.4197 37.76505
## 6 1700 Block of BUSH ST -122.4261 37.78802
## Location hour Month
## 1 (37.775420706711, -122.403404791479) 11 01
## 2 (37.775420706711, -122.403404791479) 11 01
## 3 (37.7299809672996, -122.388856204292) 14 04
## 4 (37.7857883766888, -122.412970537591) 23 01
## 5 (37.7650501214668, -122.419671780296) 0 01
## 6 (37.788018555829, -122.426077177375) 21 01
table(tidy$Month)
##
## 01 02 03 04 05 06 07 08 09 10 11 12
## 12946 12092 12362 12317 12713 12076 12166 12428 12473 13331 12670 12926
we use bar graph to give a visualization of the connection of how number of crimes differes from months. By the graph below, we could see not much height differences among each bars, but the bars of January and October are relatively taller than others which implies that the numbe of cirmes in January and October are relatively higher than those in other months. Therefore, We could suggest that authorities should enforce more security and shifts especially in January and October
tidy %>%
group_by(Month)%>%
summarize(num_incident = n()) %>%
ggplot(mapping = aes(x = Month, y = num_incident)) + geom_bar(stat = "identity")
table(tidy$hour)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 6941 4359 3494 2553 1885 1744 2518 3894 5575 5865 6483 6786 9021 7268 7621 8329
## 16 17 18 19 20 21 22 23
## 8656 9559 9718 8981 8098 7480 7099 6573
we could also use a bar graph to see the crime distribution of all crimes in hours. As you could see from the plot and table above, in year of 2016, the least crimes commited in time period of 05:00 - 05:59 and the most crimes commited in time period of 18:00 - 18:59. Baseing on boxplot, we could see that the peroid of 1:00 - 11:59 AM is the period that crime commited under the average value. Therefore, Police officer can have more security check ands shifts around city in time period of 12:00 - 00:59.
tidy %>%
group_by(hour) %>%
summarize(num_incident = n()) %>%
ggplot(mapping = (aes(x = hour, y = num_incident))) + geom_boxplot() + geom_bar(stat = "identity")
sort(table(tidy$DayOfWeek))
##
## Sunday Monday Tuesday Wednesday Thursday Saturday Friday
## 20205 20783 21242 21332 21395 22172 23371
From the table, there is not much difference between the distribution in each day of week, and the top three days that crimes mostly commited are Friday, Saturday and Thrusday. We use bar graph to visualize the data. From the data, We could suggest that authorities should enforce more security and shifts around weekend (starting on Friday)
tidy %>%
group_by(DayOfWeek) %>%
summarize(num_incident = n()) %>%
ggplot(mapping = (aes(x = DayOfWeek, y = num_incident))) + geom_boxplot() + geom_bar(stat = "identity")
sort(table(tidy$Category))
##
## TREA PORNOGRAPHY/OBSCENE MAT
## 3 4
## GAMBLING BAD CHECKS
## 20 34
## SEX OFFENSES, NON FORCIBLE LOITERING
## 40 42
## FAMILY OFFENSES EXTORTION
## 53 60
## BRIBERY SUICIDE
## 66 69
## RUNAWAY LIQUOR LAWS
## 140 156
## EMBEZZLEMENT KIDNAPPING
## 168 257
## ARSON DRIVING UNDER THE INFLUENCE
## 286 378
## DRUNKENNESS FORGERY/COUNTERFEITING
## 465 619
## PROSTITUTION DISORDERLY CONDUCT
## 641 658
## RECOVERED VEHICLE STOLEN PROPERTY
## 736 882
## SEX OFFENSES, FORCIBLE WEAPON LAWS
## 940 1658
## TRESPASS SECONDARY CODES
## 1812 1841
## FRAUD ROBBERY
## 2635 3299
## DRUG/NARCOTIC MISSING PERSON
## 4243 4338
## SUSPICIOUS OCC BURGLARY
## 5782 5802
## WARRANTS VEHICLE THEFT
## 5914 6419
## VANDALISM ASSAULT
## 8589 13577
## NON-CRIMINAL OTHER OFFENSES
## 17866 19599
## LARCENY/THEFT
## 40409
By looking at the table of categories, the top three crimes commited in San Franciso are LARCENY/THEFT, OTHER OFFENSES and NON-CRIMINAL. A good way to do visualize is to build up a pie chart, then we could see the proportion and differences among all categories. As you could see the largest proportion LARCENY/THEFT is more than a quarter.Therefore, we could suggest authorities should take action and prevention to crimes of larceny and theft more.
tidy %>%
group_by(Category) %>%
summarize(num_incident = n()) %>%
ggplot(aes(x="", y=num_incident, fill=Category)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0)
In this part, we showed you a easy way of praparing, tidying and visualizing of the data step by step. You will understand the diffence and distribution of attributes in the dataset easily by doing this.
A useful visualization is for geographic data is using the interactive map. Each incident has a location coordinates which let us able to see the distribution of crime incidents in San Francisco for our data. In this section, we want to have a better understanding of whether there’s a certain area in San Francisco that has a higher criminal rate and if there’s such area, is there any time in a day or any day in a week has higher criminal rate?
First, we set the map view using the latitudes and longitudes of San Francisco:
map <- leaflet(tidy) %>%
addTiles() %>%
setView(lat=37.7740, lng=-122.4313, zoom=11)
map
The following table shows how we interpret our data set in the interactive map:
| Color | Incident Time |
|---|---|
| yellow | 6am - 12 am |
| navy | 12pm - 6pm |
| red | 6pm - 12am |
| black | 12am - 6am |
| Color | Day of Week |
|---|---|
| red | Monday |
| orange | Tuesday |
| yellow | Wednesday |
| green | Thursday |
| blue | Friday |
| navy | Saturday |
| purple | Sunday |
Then, we need to set the elements to display the data, popup information, different colors for different time or day of week, and the icons.
This one is for Incident Time:
color <- function(tidy){
sapply(tidy$hour, function(hour){
if (as.integer(hour) >= 6 & as.integer(hour) < 12){
"yellow"
} else if (as.integer(hour) >=12 & as.integer(hour) < 18){
"navy"
} else if (as.integer(hour) >= 18){
"red"
} else {
"black"
}
})
}
icons <- awesomeIcons(
icon = 'ios-close',
iconColor = 'black',
library = 'ion',
markerColor = color(tidy)
)
label <- paste("<b>Day of Week: </b>", tidy$DayOfWeek, "<br>",
"<b>Address: </b>", tidy$Address, "<br>",
"<b>Category: </b>", tidy$Category, "<br>",
"<b>Description: </b>", tidy$Descript, "<br>",
"<b>Resolution: </b>", tidy$Resolution, "<br>")
We use markers to represent each entity in the samples that have different incident times.
map <- map %>%
addAwesomeMarkers(
data = tidy,
lng = tidy$X,
lat = tidy$Y,
icon = icons,
popup = label,
clusterOptions = markerClusterOptions(),
group = 'time'
) %>%
addLegend(position = "bottomright", colors = c("yellow", "navy", "red", "black"),
labels = c("6am - 12 am", "12pm - 6pm", "6pm - 12am", "12am - 6am"),
title = "Different Incident Time", group = 'time')
This one is for Day of Week:
color2 <- function(tidy){
sapply(tidy$DayOfWeek, function(DayOfWeek){
if (stri_cmp(DayOfWeek, "Monday") == 0){
"red"
} else if (stri_cmp(DayOfWeek, "Tuesday") == 0){
"orange"
} else if (stri_cmp(DayOfWeek, "Wednesday") == 0){
"yellow"
} else if (stri_cmp(DayOfWeek, "Thursday") == 0){
"green"
} else if (stri_cmp(DayOfWeek, "Friday") == 0){
"blue"
} else if (stri_cmp(DayOfWeek, "Saturday") == 0){
"navy"
} else {
"purple"
}
})
}
We use circles to represent each entity in the samples that have a different incident day of the week.
map <- map %>%
addCircleMarkers(
data = tidy,
lng = tidy$X,
lat = tidy$Y,
color = color2(tidy),
clusterOptions = markerClusterOptions(),
group = 'day'
) %>%
addLegend(position = "bottomleft", colors = c("red", "orange", "yellow", "green", "blue", "navy", "purple"),
labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"),
title = "Different Day of Week", group = 'day')
Then we combine these two maps together:
map <- map %>%
addLayersControl(overlayGroups = c('time', 'day'), options = layersControlOptions(collapsed = FALSE)) %>%
hideGroup("day")
map